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MEASURING REPRODUCIBILITY OF HIGH-THROUGHPUT 

EXPERIMENTS 1 
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University of California at Berkeley 

Reproducibility is essential to reliable scientific discovery in high- 
throughput experiments. In this work we propose a unified approach 
to measure the reproducibility of findings identified from replicate 
experiments and identify putative discoveries using reproducibility. 
Unlike the usual scalar measures of reproducibility, our approach cre- 
ates a curve, which quantitatively assesses when the findings are no 
longer consistent across replicates. Our curve is fitted by a copula 
mixture model, from which we derive a quantitative reproducibility 
score, which we call the "irreproducible discovery rate" (IDR) anal- 
ogous to the FDR. This score can be computed at each set of paired 
replicate ranks and permits the principled setting of thresholds both 
for assessing reproducibility and combining replicates. 

Since our approach permits an arbitrary scale for each replicate, 
it provides useful descriptive measures in a wide variety of situations 
to be explored. We study the performance of the algorithm using 
simulations and give a heuristic analysis of its theoretical proper- 
ties. We demonstrate the effectiveness of our method in a ChlP-seq 
experiment. 

1. Introduction. High-throughput profiling technologies play an indis- 
pensable role in modern biology. By studying a large number of candidates 
in a single experiment and assessing their significance using data analyti- 
cal tools, high-throughput technologies allow researchers to effectively select 
potential targets for further studies. Despite their ubiquitous presence in 
biological research, it is known that any single experimental output from 
a high-throughput assay is often subject to substantial variability. Repro- 
ducibility of high-throughput assays, such as the level of agreement between 
results from replicate experiments across (biological or technical) replicate 
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samples, test sites or experimental or data analytical platforms, is a constant 
concern in their scientific applications [e.g., MAQC consortium (2006) in mi- 
croarray experiments, Park (2009) in ChlP-seq technology] . Metrics that ob- 
jectively assess the reproducibility of high-thoughput assays are important 
for producing reliable scientific discoveries and monitoring the performances 
of data generating procedures. 

An important criterion for assessing reproducibility of results from high- 
throughput experiments is how reproducibly top ranked signals are reported 
in replicate experiments. These signals and their significance scores, often 
presented as the primary results to be accessed by downstream steps, are 
critical for prioritizing follow-up studies. A common approach to assess this 
reproducibility is to compute the Spearman's pairwise rank correlation co- 
efficient between the significance scores for signals that pass a prespecified 
significance threshold on each replicate [see MAQC consortium (2006) and 
Kuo et al. (2006) for examples in microarray studies]. However, the Spear- 
man's correlation coefficient actually is not entirely suitable for measuring 
the reproducibility between two rankings in this type of application. First, 
this summary depends on the choice of significance thresholds and may ren- 
der false assessments that reflect the effect of thresholds rather than the 
data generating procedure to be evaluated. For instance, with everything 
else being equal, stringent thresholds generally produce higher rank corre- 
lations than relaxed thresholds when applied to the same data. Although 
standardizing thresholds in principle can remove this confounding effect, 
calibration of scoring systems across replicate samples or different meth- 
ods is challenging in practice, especially when the scores or their scales 
are incomparable on replicate outputs. Though this difficulty seemingly is 
associated only with heuristics-based scores, indeed, it is also present for 
probabilistic-based scores, such as p-values, if the probabilistic model is ill- 
defined. For example, it has been reported in large-scale systematic analyses 
that strict reliance on p-values in reporting differentially expressed genes 
causes an apparent lack of inter-platform reproducibility in microarray ex- 
periments [MAQC consortium (2006)]. Second, the rank correlation treats 
all ranks equally, though the differences in the top ranks seem to be more 
critical for judging the reproducibility of findings from high-throughput ex- 
periments. Alternative measures of correlation that give more importance 
to higher ranks than lower ones, for instance, by weighing the difference of 
ranks differently, have been developed in more general settings [e.g.. Blest 
(2000); Genest and Plante (2003); da Costa and Soares (2005)] and applied 
to this application [see Boulesteix and Slawski (2009) for a review]. However, 
all these measures are also subject to prespecified thresholds and raise the 
question of how to select the optimal weighing scheme. 

In this work we take an alternative approach to measure the reproducibil- 
ity of results in replicate experiments. Instead of depending on a prespecified 
threshold, we describe reproducibility as the extent to which the ranks of the 
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signals are no longer consistent across replicates in decreasing significance. 
We propose a copula-based graphical tool to visualize the loss of consistency 
and localize the possible breakdown of consistency empirically. We further 
quantify reproducibility by classifying signals into a reproducible and an ir- 
reproducible group, using a copula mixture model. By jointly modeling the 
significance of scores on individual replicates and their consistency between 
replicates, our model assigns each signal a reproducibility index, which esti- 
mates its probability to be reproducible. Based on this index, we then define 
the irreproducible discovery rate (IDR) and a selection procedure, in a fash- 
ion analogous to their counterparts in multiple testing, to rank and select 
signals. As we will illustrate, the selection by this reproducibility criterion 
provides the potential for more accurate classification. The overall repro- 
ducibility of the replicates is described using IDR as the average amount of 
irreproducibility in the signals selected. 

The proposed approach, indeed, is a general method that can be applied 
to any ranking systems that produce scores without ties, though we discuss 
it in the context of high-throughput experiments. Because our copula-based 
approach does not make any parametric assumptions on the marginal distri- 
butions of scores, it is applicable to both probabilistic- and heuristic-based 
scores. When a threshold is difficult to determine in a scoring system, for 
example, heuristic-based scores, it provides a reproducibility-based criterion 
for setting selection thresholds. 

In the next section we present the proposed graphical tool (Section 2.1), 
the copula mixture model and its estimation procedure (Section 2.2), and 
the reproducibility criterion (Section 2.3). In Section 3 we use simulations 
to evaluate the performance of our model, and compare with some existing 
methods. In Section 4 we apply our method to a data set that motivated this 
work. The data set was generated by the ENCODE consortium [ENCODE 
Project Consortium (2004)] from a ChlP-seq assay, a high-throughput tech- 
nology for studying protein-binding regions on the genome. The primary in- 
terest is to assess the reproducibility of several commonly used and publicly 
available algorithms for identifying the protein-binding regions in ChlP-seq 
data. Using this data, we compare the reproducibility of these algorithms 
in replicate experiments, infer the reliability of signals identified by each 
algorithm, and demonstrate how to use our method to identify subopti- 
mal results. Section 5 is a general discussion. Finally, we present a heuristic 
justification of our algorithm on optimality grounds in the supplementary 
materials [Li et al. (2011)]. 

2. Statistical methods. The data that we consider consist of a large num- 
ber of putative signals measured on very few replicates of the same under- 
lying stochastic process, for example, protein binding sites identified on the 
genomes of biological replicates in ChlP-seq experiments. We assume that 
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each putative signal has been assigned a score that relates to the strength of 
the evidence for the signal to be real on the corresponding replicate by some 
data analysis method. The score can be either heuristic based (e.g., fold 
enrichment) or probabilistic based (e.g., value). We further assume that 
all the signals are assigned distinct significance scores and that the signifi- 
cance scores reasonably represent the relative ranking of signals. However, 
the distribution and the scale of the scores are unknown and can vary on 
different replicates. We assume without loss of generality that high scores 
represent strong evidence of being genuine signals and are ranked high. By 
convention, we take the "highest" rank to be 1 and so on. We shall use the 
scores as our data. 

We assume n putative signals are measured and reported on each repli- 
cate. Then the data consist of n signals on each of the m replicates, with the 
corresponding vector of scores for signal i being (xj.i, . . . ,Xi^m)- Here Xij is 
a scalar score for the signal on replicate j. Our goal is to measure the repro- 
ducibility of scores across replicates and select reliable signals by considering 
information on the replicates jointly. In what follows, we focus on the case 
of two replicates (i.e., m = 2), although the methods in this paper can be 
extended to the case with more replicates (see supplementary materials [Li 
et al. (2011)]). 

If replicates measure the same underlying stochastic process, then for 
a reasonable scoring system, the significance scores of genuine signals are 
expected to be ranked not only higher but also more consistently on the 
replicates than those of spurious signals. When ranking signals by their sig- 
nificance scores, a (high) positive association is expected between the ranks 
of scores for genuine signals. A degradation or a breakdown of consistency be- 
tween ranks may be observed when getting into the noise level. This change 
of association provides an internal indicator of the transition from real sig- 
nal to noise. We will use this information in measuring the reproducibility 
of signals. 

In this section we first present a graphical tool (Section 2.1) for visualizing 
the change of association and localizing the possible breakdown of associa- 
tion, empirically without model assumptions. We then present a model-based 
approach (Section 2.2), which quantifies the heterogeneity of association and 
leads to a reproducibility criterion for threshold selection. 

2.1. Displaying the change of association. As we mentioned, the bivari- 
ate association between the significance scores is expected to be positive for 
significant signals, then transits to zero when getting into noise level. By 
visualizing how association changes in the decreasing order of significance, 
one may localize the transition of association and describe reproducibility 
in terms of how soon consistency breaks down and how much empirical 
consistency departs from perfect association before the breakdown. 
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Rank-based graphs are useful tools for displaying bivariate dependence 
structure, because they are invariant with respect to monotone transforma- 
tions of the variables and are thus scale free. Earlier papers have proposed 
rank-based graphical tools, such as the Chi-plot [Fisher and Switzer (1985, 
2001)] and the Kendall plot [Genest and Boies (2003)], for visualizing the 
presence of association in samples from continuous bivariate distributions. 
Related to nonparametric tests of independence, these graphs primarily are 
designed for detecting bivariate dependence by representing the presence of 
association as departures from the pattern under independence. The type 
and the level of simple bivariate association may be inferred by comparing 
the patterns of dependence observed in these plots with the prototypical 
patterns in Fisher and Switzer (1985, 2001), Genest and Boies (2003). How- 
ever, these graphs are less informative, when heterogeneity of association, 
such as the one described here, is present. (See Figure 2 in Section 2.1.2 for 
an illustration on a real data set with mixed populations.) 

We now present our rank-based graph, which we refer to as a correspon- 
dence curve, intended to explicitly display the aforementioned change of 
association. 

2.1.1. Correspondence curves. Let (^i^i , Xi 2)j • ■ • > (-^n,i > ^n.2) be a sam- 
ple of scores of n signals on a pair of replicates. Define 

1 " 

(2.1) '^nit,v) = -^l(Xj,i > X(p(i_j)„])^i,Xi,2 > a;(p(i_^,)„]),2), 

^ i=l 

0<t<l,0<w<l, 

where 1 and X(|-(i_„)„-|)^2 denote the order statistics of Xi and X2, 

respectively. ^'n(t, v) essentially describes the proportion of the pairs that are 
ranked both on the upper t% of Xi and on the upper u% of X2, that is, the 
intersection of upper ranked identifications. As consistency usually is deemed 
a symmetric notion, we will just focus on the special case oit = v and use the 
shorthand notation ^n(i) in what follows. In fact, ^nit,v) is an empirical 
survival copula [Nelson (1999)], and ^'n(i) is the diagonal section of ^'„(t,u) 
[Nelson (1999)]. (See Section 2.2.1 for a brief introduction of copulas.) Define 
the population version ^{t) = lim„ ^'n(t). Then ^{t) and its derivative ^'{t), 
which represent the change of consistency, have the following properties. (See 
supplementary materials [Li et al. (2011)] for derivation.) 

Let R{Xi^i) and R{Xi^2) be the ranks of Xi^i and Xi^2-, respectively. 

(1) If R{Xi^i) = R{X,^2) for G (^-^(1 - t),F-\l - io)], j = 1,2, with 
< to < i < 1, *(i) = ^(^o) + t-tQ and ^''(t) = 1. 

(2) If i?(X,,i) ± R{X,,2) for Xi^j G (^-^(l - t),F-\l)lj = 1, 2, with < 
t < 1, ^^{t) = and ^''(t) = 2t. 
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(3) If R{Xi^i) = R{Xi,2) for X^,, G {Fr\l - to),Fr\l)] and R{Xi) ± 
R{X2) for Xij G (Fri(0),Fri(l - to)],j = 1,2, with < to < 1, then for 
to < t < 1, ^(0 = and ^'(t) = 

The last case describes an ideahzed situation in our applications, where 
all the genuine signals are ranked higher than any spurious signals, and 
the ranks on the replicates are perfectly correlated for genuine signals but 
completely independent for spurious signals. The same properties are ap- 
proximately followed in the corresponding sample version and with 
finite differences replacing derivatives. 

To visualize the change of consistency with the decrease of significance, 
a curve can be constructed by plotting the pairs [or (t,^'^(t))] 

for < t < 1. The resulting graphs, which we will refer to as a correspon- 
dence curve (or a change of correspondence curve, resp.), depend on i 
and Xi^2 only through their ranks, and are invariant to both location and 
scale transformation on Xi^i and Xi^2- Corresponding to the three special 
cases described earlier, the curves have the following patterns: 

(1) When R{Xi^i) and R{Xi^2) are perfectly correlated for i = l,...,n, 
all points on the curve of will fall on a straight line of slope 1, and all 
points on the curve of will fall on a straight line with slope 0. 

(2) When R[Xi^i) and R[Xi^2) are independent for i = 1, . . . , n, all points 
on the curve of will fall on a parabola t^, and all points on the curve 
of fall on a straight line of slope of 2t. 

(3) When R{Xi^i) and R{Xi^2) are perfectly correlated for the top tQn 
observations and independent for the remaining (1 — t{))n, top ton points fall 
into a straight line of slope 1 on the curve of ^„ and slope on the curve 
of "^'n, and the rest (1 - tQ)n points fall into a parabola ^'„(t) = * 

(t > to) on the curve of and a straight line of slope on the curve 

of 

These properties show that the level of positive association and the possi- 
ble change of association can be read off these types of curves. For the curve 
of ^'n, strong association translates into a nearly straight line of slope 1, 
and lack of association shows as departures from the diagonal line, such as 
curvature bending toward the x-axis [i.e., ^n{t) <t\] if almost no associa- 
tion is present, the curve shows a parabolic shape. Similarly, for the curve 
of strong association translates into a nearly straight line of slope 0, and 
lack of association shows as a line with a positive slope. The transition of 
the shape of the curves, if present, indicates the breakdown of consistency, 
which provides guidance on when the signals become spurious. 

2.1.2. Illustration of the correspondence curves. We first demonstrate 
the curves using an idealized case (Figure 1), where R{X) and R{Y) agree 
perfectly for the top 50% of observations and are independent for the re- 
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Fig. 1. An illustration of the correspondence profile in an idealized case, where top 50% 
are genuine signals and bottom 50% are noise. In this case, all signals are ranked higher 
than noise; two rank lists have perfect correspondence for signals and no correspondence 
for noise, (a) Correspondence curve, (b) Change of correspondence curve. 



maining 50% of observations. The curves display the pattern described in 
case 3 above. The transition of the shape of the curves occurs at 50%, which 
corresponds to the occurence of the breakdown of consistency. Transition 
can be seen more visibly on the curve of by the gap between the disjoint 
lines with and positive slopes, which makes a better choice for inspect- 
ing and localizing the transition than especially when the transition is 
less sharp. More simulated examples are presented in Section 3 to illustrate 
the curves in the presence and absence of the transition of association in 
more realistic settings. 

We now compare the plot with the Chi-plot and the X-plot using 
a real example considered in Kallenberg and Ledwina (1999), Fisher and 
Switzer (2001), Genest and Boies (2003). This data set consists of 28 mea- 
surements of size of the annual spawning stock of salmon and corresponding 
production of new catchable-sized fish in the Skeena River. It was specu- 
lated by Fisher and Switzer (2001) to contain a mixed populations with 
heterogeneous association. Though the dissimilarity of Chi-plot or X-plot 
to their prototypical plots [cf. Fisher and Switzer (2001); Genest and Boies 
(2003)] suggests the data may involve more than simple monotone associa- 
tion [Fisher and Switzer (2001); Genest and Boies (2003)], neither of these 
plots manifest heterogeneity of association. In the curve [Figure 2(d)], 
the characteristic pattern of transition is observed at about t = 0.5, which 
indicates that the data is likely to consist of two groups, with roughly the top 
50% from a strongly associated group and the bottom 50% from a weakly 
associated group. It agrees with the speculation in Fisher and Switzer (2001). 

2.2. Inferring the reproducibility of signals. In this section we present 
a statistical model that quantifies the dependence structure and infers the 
reliability of signals. Throughout this section, we will suppose, for simplicity. 
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(c) (d) 

Fig. 2. Rank scatterplot (a), C'hi-plot (b), K-plot (c) and the change of correspondence 
curve (d) for salmon data, which consists of 28 measurements of size of the annual spawn- 
ing stock of salmon and corresponding production of new catchable- sized fish in the Skeena 
River. The curve of is produced by taking derivative on the spline that fits with 
df= 6.4. 

that we are dealing with a sample of i.i.d. observations from a population. 
Though this is in fact unrealistic in many applications, in particular, for 
the signals from genome- wide profiling (e.g., ChlP-seq experiments), where 
observations are often dependent, the descriptive and graphical value of our 
method remains, as we are concerned with first order effects. 

In general, genuine signals tend to be more reproducible and score higher 
than spurious ones. The scores on replicates may be viewed as a mixture of 
two groups, which differ in both the strength of association and the level 
of significance. Recall that in these applications, the distributions and the 
scales of scores are usually unknown and may vary across data sets. To 
model such data, a semiparametric copula model is appropriate, in which 
the associations among the variables are represented by a simple parametric 
model but the marginal distributions are estimated nonparametrically using 
their ranks to permit arbitrary scales. Though using ranks, instead of the 
raw values of scores, generally causes some loss of information, this loss 
is known to be asymptotically negligible [Lehmann (2006)]. In view of the 
heterogeneous association in the genuine and spurious signals, we further 
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model the heterogeneity of the dependence structure in the copula model 
using a mixture model framework. 

Before proceeding to our model, we first provide a brief review of copula 
models, and refer to Joe (1997) and Nelson (1999) for a modern treatment 
of copula theory. 

2.2.1. Copulas. The multivariate function C = C{ui, . . . ,Up) is called 
a copula if it is a continuous distribution function and each marginal is a uni- 
form distribution function on [0, 1]. That is, C : [0, 1]^ — ?• [0, 1], with C{u) = 
P{Ui < ui, . . . ,Up < Up), in which each Uj ~ Unif [0, 1] and u = (ui, . . . , Up). 
By Sklar's theorem [Sklar (1959)], every continuous multivariate probability 
distribution can be represented by its univariate marginal distributions and 
a copula, described using a bivariate case as follows. 

Let Xi and X2 be two random variables with continuous CDFs Fi and F2. 
The copula C of Xi and X2 can be found by making the marginal probability 
integral transforms on Xi and X2 so that 

(2.2) C{ui,U2) = F{F{\ui),F2\u2)), ^/i, € [0, 1], 

where F is the joint distribution function of {Xi,X2), Fi and F2 are the 
marginal distribution functions of Xi and X2, respectively, and F^^ and F2^ 
are the right-continuous inverses of Fi and F2, defined as F~^{u) = inf {z : 
Fj{z) > u}. That is, the copula is the joint distribution of Fi{Xi), i<2(^2)- 
These variables are unobservable but estimable by the normalized ranks 
Fni{Xi), Fn2{X2) where Fni, Fn2 are the empirical distribution functions 
of the sample. The function 6cit,t) = C{t,t) is usually referred to as the 
diagonal section of a copula C. We will use the survival function of the cop- 
ula C, C{ui,U2) = P{Ui > 1 — ni,[/2 > 1 — ^2), which describes the relation- 
ship between the joint survival function [F{xi,X2) = P{Xi > xi,X2 > X2)] 
and its univariate margins {Fj = 1 — Fj) in a manner completely analogous 
to the relationship between univariate and joint functions, as C{ui,U2) = 
F(Ff ^(ni),F2"^(u2)). The sample version of (2.2) is called an empirical cop- 
ula [Deheuvels (1979); Nelson (1999)], defined as 

(2.3) =^5^1(a;fc,i<a;(i),i,Xfc,2<2;(j),2), l<i,j<n, 
^ ^ k=i 

for a sample of size n, where a^(j) and Xj-j) 2 denote order statistics on each 
coordinate from the sample. The sample version of survival copulas follows 
similarly. 

This representation provides a way to parametrize the dependence struc- 
ture between random variables separately from the marginal distributions, 
for example, a parametric model for the joint distribution of ui and U2 
and a nonparametric model for marginals. Copula-based models are natu- 
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ral in situations where learning about the association between the variables 
is important, but the marginal distributions are assumably unknown. For 
example, the 2-dimensional Gaussian copula C is defined as 

(2.4) C{ui,U2\p) = M^-Hui),<^-Hu2)\p), 

where $ is the standard normal cumulative distribution function, $2('; 'l/c) is 
the cumulative distribution function for a bivariate normal vector (21,22) ~ 

^(0) 5 (^p 1))' ^ correlation coefficient. Modeling dependence 

with arbitrary marginals Fi and F2 using the Gaussian copula (2.4) amounts 
to assuming data is generated from latent variables (21,22) by setting xi = 
Fj~^($(2i)) and X2 = F2^{^{z2))- Note that if Fi and F2 are not continuous, 
ui and U2 are not uniform. For convenience, we assume that Fi and F2 are 
continuous throughout the text. 

2.2.2. A copula mixture model. We now present our model for quanti- 
fying the dependence structure and inferring the reproducibility of signals. 
We assume throughout this part that our data is a sample of i.i.d. bivariate 
vectors (xj^i, Xi,2)- 

We assume the data consists of genuine signals and spurious signals, which 
in general correspond to a more reproducible group and a less reproducible 
group. We use the indicator Ki to represent whether a signal i is genuine 
{Ki = 1) or spurious (Ki = 0). Let tti and vro = 1 — vri denote the proportion 
of genuine and spurious signals, respectively. Given Ki = I, we assume the 
pairs of scores for genuine (resp., spurious) signals are independent draws 
from a continuous bivariate distribution with density /i(-,-) [resp., /o(t)) 
given Ki = 0] with joint distribution Fi(-,-) [resp., Fo(-,-)]. Note, however, 
that even if the marginal scales are known, Ki would be unobservable so that 
the copula is generated by the marginal mixture (with respect to Ki), Fj = 
ttqF^ + TTiFj, where Fj is the marginal distribution of the jth coordinate 
and Fj is the marginal distribution of the corresponding fcth component. 

Because genuine signals generally are more significant and more repro- 
ducible than spurious signals, we expect the two groups to have both dif- 
ferent means and different dependence structures between replicates. We 
assume that, given the indicator Ki, the dependence between replicates for 
genuine (resp., spurious) signals is induced by a bivariate Gaussian distri- 
bution zi = (21^1,21.2) [or resp., zq = (-^0,1) -2^0,2)]- The choice of Gaussian 
distribution for inducing the dependence structure in each component is 
made based on the observation that the dependence within a component 
in the data we consider generally is symmetric and that an association pa- 
rameter with a simple interpretation, such as the correlation coefficient for 
a Gaussian distribution, is natural. 

As the scores from Fi(-,-) are expected to be higher than the scores 
from Fq{-,-), we assume zi has a higher mean than zq. Since spurious sig- 
nals are presumably less reproducible, we assume corresponding signals on 
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the replicates to be independent, that is, po = 0; whereas, since genuine sig- 
nals usually are positively associated between replicates, we expect pi > 0, 
though pi is not required to be positive in our model. It also seems natural 
to assume that the underlying latent variables, reflecting replicates, have 
the same marginal distributions. Finally, we note that if the marginal scales 
are unknown, we can only identify the difference in means of the two latent 
variables and the ratio of their variances, but not the means and variances 
of the latent variables. Thus, the parametric model generating our copula 
can be described as follows: 

Let iCj ~ Bernoulli (tti) and (zj^i, 2:4^2) be distributed as 

where po = 0, pi>Q, ctq = 1, po = 0, < pi < 1. 
Let 



0"! V 0"! / 

(2.5b) 

Ui,2^G{z,^2) = -^('-^^^] +vro$(^^,2). 

0"! V ^1 / 

Our actual observations are 

(2.5c) _ 

where Fi and F2 are the marginal distributions of the two coordinates, which 
are assumed continuous but otherwise unknown. 

Thus, our model, which we shall call a copula mixture model, is a semi- 
parametric model parametrized by ^ = (vri, pi, erf , pi) and (^1,^2). The cor- 
responding mixture likelihood for the data is 



L{9) = ll[7roho{G-HFi{x,,i)),G-HF2{x,,2))) 

i=l 

(2.6a) +^i/ii(G'"i(Fi(x,,i)),G^i(F2(x,,2)))] 

n 

(2.6b) =n[c(i^iKi),^2(x^,2))5(G'-^(Fi(x,,i)))5(G-Hi^2(x.,2)))] 

i=l 

where 

7Toho{G-Hui),G'Hu2))+TTlhi{G-Hui),G~Hu2)) 



(2.7) c{ui,U2) 



g{G-\u^))g{G~\u2)) 
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is a copula density function with 

G is defined in (2.5b) and g is the density function of G. Note that G depends 
on 0. 

Given the parameters 0, the posterior probabihty that a signal i is in the 
irreproducible group can be computed as 

(2.8) P,(/^ ^ 1 ^ ^ '°''°'f ;'AS;''^;''A';^f' » 

We estimate values for these classification probabilities by estimating the 
parameters 9 using an estimation procedure described in Section 2.2.3, and 
substituting these estimates into the above formulas. 

The idea of using a mixture of copulas to describe complex dependence 
structures is not entirely new. For example, the mixed copula model [Hu 
(2006)] in economics uses a mixture of copulas [Cmix{ui-,U2 \ (6*1, . . . ,^fc)) = 
S!Li C{ui,U2 I 9i)] to generate flexible fits to the dependence structures that 
do not follow any standard copula families. In this model, all the copulas in 
Cmix are assumed to have identical marginal distributions. In contrast, the 
copula in our model not only has mixed associations, but also allows different 
associations to occur with different marginal distributions {Fj and Fj), 
thus can be viewed as a generalization of the case with the same marginal 
distribution. In addition, our modeling goal is to cluster the observations 
into groups with homogeneous associations, instead of data fitting. This 
difference in marginal distributions calls for nonstandard estimation, which 
we expect to be efficient, as we shall see in Section 2.2.3. 

2.2.3. Estimation of the copula mixture model. In this section we de- 
scribe an estimation procedure that estimates the parameters in (2.6) and 
the membership Ki of each observation. 

A common strategy to estimate the association parameters in semipara- 
metric copula models is a "pseudo-likelihood" approach, which is described 
in broad, nontechnical terms by Oakes (1994). In this approach, the em- 
pirical marginal distribution functions Fj, after rescaling by multiplying by 
(;^) to avoid infinities, are plugged into the copula density in (2.6b), ignor- 
ing the terms involving g. The association parameters are then estimated 
by maximizing the pseudo-copula likelihood. Genest, Ghoudi and Rivest 
(1995) showed, without specifying the algorithms to compute them, that 
under certain technical conditions, the estimators obtained from this ap- 
proach are consistent, asymptotically normal, and fully efficient only if the 
coordinates of the copula are independent. 

We adopt a different approach which, in principle, leads to efficient esti- 
mators under any choice of parameters and Fi, F2. Note that the estimation 
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of the association parameter pi depends on the estimation of fii , af and vri 
due to the presence of the mixture structure on marginal distributions, which 
makes the log-hkehhood (2.6) difficult to maximize directly. Our approach 
is to estimate the parameters 9 by maximizing the log-likelihood (2.6) of 

pseudo-data G~^{j^Fij;9), where Fij = Fj{xij). 

As the latent variables zqj and zi,j in our model form a mixture distri- 
bution, it is natural to use an expectation-maximization (EM) algorithm 
[Dempster, Laird and Rubin (1977)] to estimate the parameters 9 and infer 
the status of each putative signal for pseudo-data. In our approach, we first 
compute the pseudo-data G~^(^^-Fij; ^o) from some initialization param- 
eters 9^^\ then iterate between two stages: (1) maximizing 9 based on the 
pseudo-data using EM and (2) updating the pseudo-data. The detailed pro- 
cedure is given in the supplementary materials [Li et al. (2011)]. The EM 
stage may be trapped in local maxima, and the stage of updating pseudo- 
data may not converge from all starting points. However, in the simulations 
we performed (Section 3), it behaves well and finds the global maxima, when 
started from a number of initial points. 

We sketch in the supplementary materials [Li et al. (2011), Section 2] 
a heuristic argument that a limit point of our algorithm close to the true 
value satisfies an equation whose solution is asymptotically efficient. Al- 
though our algorithm converges in practice, we have yet to show its conver- 
gence in theory. However, a modification which we are investigating does con- 
verge to the fixed point mentioned above. This work will appear elsewhere. 

2.3. Irreproducihle identification rate. In this section we derive a repro- 
ducibility criterion from the copula mixture model in Section 2.2.2 based 
on an analogy between our method and the multiple hypothesis testing 
problem. This criterion can be used to assess the reproducibility of both 
individual signals and the overall replicate outputs. 

In the multiple hypothesis testing literature, the false discovery rate (FDR) 
and its variants, including positive false discovery rate (pFDR) and marginal 
false discovery rate (mFDR), are introduced to control the number of false 
positives in the rejected hypotheses [Benjamini and Hochberg (1995); Storey 
(2002); Genovese and Wasserman (2002)]. In the FDR context, when hy- 
potheses are independent and identical, the test statistics can be viewed as 
following a mixture distribution of two classes, corresponding to whether or 
not the statistic is generated according to the null hypothesis [e.g., Efron 
(2004); Storey (2002)]. Based on this mixture model, the local false discov- 
ery rate, which is the posterior probability of being in the null component 
Lfdr(-) = (1 — 7r)/o(-)//(-), was introduced to compute the individual signifi- 
cance level [Efron (2004)]. Sun and Cai (2007) show, again for the i.i.d. case, 
that Lfdr is also an optimal statistic in the sense that the thresholding rule 
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based on Lfdr controls the marginal false discovery rate with the minimum 
marginal false nondiscovery rate. 

As in multiple hypothesis testing, we also build our approach on a mixture 
model and classify the observations into two classes. However, the two classes 
have different interpretation and representation: The two classes represent 
irreproducible measurements and reproducible measurements in our model, 
in contrast to nulls and nonnulls in the multiple testing context, respectively. 

In analogy to the local false discovery rate, we define a quantity, which 
we call the local irreproducible discovery rate, to be 



This quantity can be thought of as the a posteriori probability that a signal 
is not reproducible on a pair of replicates [i.e., (2.8)], and can be estimated 
from the copula mixture model. 



Similarly, we define the irreproducible discovery rate (IDR) in analogy to 
the mFDR, 



where I-y = {{xi^i,Xi^2) : idr(rEj^i, Xj^2) < l}, Hq and H are the CDF of den- 
sity functions ho and h = ttq/io + vri/ii, respectively. For a desired con- 



think of this procedure as giving an expected rate of irreproducible discov- 
eries no greater than a. It is analogous to the adaptive step- up procedure 
of Sun and Cai (2007) for the multiple testing case. 

This procedure essentially amounts to re-ranking the identifications ac- 
cording to the likelihood ratio of the joint distribution of the two replicates. 
The resulting rankings are generally different from the ranking of the original 
significance scores on either replicate. 

Unlike the multiple testing procedure, our procedure does not require Xij 
to be p-values; instead, be any scores with continuous marginal 

distributions. When p-values are used as scores, our method can also be 
viewed as a method to combine p- values. We compare our method and two 
commonly-used p- value combinations through simulations in Section 3. 

3. Simulation studies. 

3.1. Illustration of correspondence curves. To show the prototypical plots 
of more realistic cases, we use simulated data to compare and contrast the 
curves in presence and absence of the transition of association described in 




(2.10) 



IDR(7) = P {irreproducible \ i € I^) 



7roJj^dHoiG-HFiixi,i)),G-HF2ixi,2))) 
/,Jd//(G-i(Fi(xi,i)),G-i(F2(xi,2))) 
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Section 2.1. (Figure 3). The case where no transition occurs is illustrated 
using two single-component bivariate Gaussian distributions with homoge- 
neous association, p = [Figure 3(a)] and p = 0.8 [Figure 3(b)], respectively. 
The presence of the transition is illustrated using two two-component bi- 
variate Gaussian mixtures, whose lower ranked component has independent 
coordinates (i.e., po = 0) and the higher ranked component has positively 
correlated coordinates with pi = 1 [Figure 3(c)] and pi =0.8 [Figure 3(d)], 
respectively. 

As in the idealized example (Figure 1), the characteristic transition of 
curves is observed when the transition of association is present [Figure 3(c), 
(d)], but not seen when the data consists of only one component with homo- 
geneous association. This shows that the transition of the shape of the curve 
may be used as an indicator for the presence of the transition of association. 

3.2. Copula mixture model. We first use simulation studies to examine 
the performance of our approach. In particular, we aim to assess the accu- 
racy of our classification, to evaluate the benefit of combining information 
between replicates over using only information on one replicate, and to assess 
the robustness of our method to the violation of one of its underlying model 
assumptions. In each simulation, we also compare the performance with two 
existing methods for combining significance scores across samples. However, 
as existing combination methods can be applied to only p- values, we use p- 
values as the significance scores in the comparison, though our method can 
be applied to arbitrary scores with continous marginal distributions. Here 
we consider the scenario when the p-values are not well calibrated but are 
reflective of the relative strength of evidence that the signals are real, and 
assess the accuracy of thresholds selected by all methods of comparison. 
These simulations also provide a helpful check on the convergence of our 
estimation procedure. 

In each simulation study, we generate a sample of n pairs of signals on two 
replicates. Each pair of observed signals (Zji,Zj.2) {i = I, ■ ■ ■ ,n) is a noisy 
realization of a latent signal Zj, which is independently and identically gen- 
erated from the following normal mixture model: 

Ki ~ Bernoulli(7ri), 
Zi\Ki = kr^N{pk,ri), k = 0,...,K-l, 

(3.1) 

\ Ki — k, Zi — Zi + £ijk, j — 1, 2, 

As can easily be seen from the joint distribution of (Zj^i,Zj^2) | Ki, 

\Zi,2j \\PkJ\pkiri + u;l) tI+uI ))' 

k = 0,...,K -1, 
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Fig. 3. Behavior of correspondence curves when data consists of homogeneous and het- 
erogeneous association. From left to right, the three columns are the scatterplot of ranks, 
the curve of'^ and the curve of"^'. (a) Bivariate Gaussian distribution with p = 0. (b) Bi- 
variate Gaussian distribution with p — 0.8. (c) A mixture of two bivariate Gaussian dis- 
tributions with marginals on both coordinates as fo = A'^(0, 1) and fi — N{3,1), po = 
and pi = 1 and mixing proportion tti = 0.5. (d) A mixture of two bivariate Gaussian dis- 
tributions with marginals on both coordinates as fo — N{0,1) and fi = A'"(2, 1), po = 
and pi = 0.8 and mixing proportion tti = 0.5. The curve of is produced by taking the 
derivative on the spline that fits with df— 6.4- 
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where pk = ^i^^i i this model is in fact a reparameterization of our mo- 
del (2.5a) when K = 2: (Zj^i,Zj^2) directly corresponds to the latent Gaus- 
sian variables in (2.5a) when setting hq = 0,/ii > 0,r| + = 1, po = and 
/c = 0,1. When K >2, this model is convenient for simulating data from 
multiple components and investigating the robustness of our method to the 
violation of the assumption that the data consists of a reproducible and an 
irreproducible component. After Zj j- is simulated, we transform Zi j to a Stu- 
dent's t distribution by a probability integral transformation F^^{G{Zij)), 
where Ft^ is the cdf of t distribution with df = 5 and G is as defined in (2.5b). 
We then obtain p- values from a one-sided z-test for Hq : fi = vs Hi : fi> 0, 
and use them as the significance score Xij. This procedure is equivalent 
to applying a z-test to a t distribution, thus generating p-values that are 
not calibrated but are reflective of the relative strength of evidence that the 
signals are real. It is equivalent to letting = 1 — ^{F^^{-)) in (2.5c) 

for our model. 

With p-values as the signiflcance score, our method can also be viewed 
as a way to combine p- values for ranking signals by their consensus. The 
two most commonly-used methods for combining p-values of a set of in- 
dependent tests are Fisher's combined test [Fisher (1925)] and Stouffer's z 
method [Stouffer et al. (1949)]. In Fisher's combination for the given one- 
sided test, the test statistic Qi = — 2 ^JLj^ log(|?ij) for each pair of signals 
has the xim distribution under Hq, where pij is the p- value for the ith 
signal on the jth replicate, m is the number of studies and m = 2 here. In 
Stouffer's method, the test statistic Si = -^J2^=i^~^i^ ~ Pi,j) distri- 
bution A^(0, 1) under Hq, where <I> is the standard normal CDF. For each 
pair of signals, we compute Qi {Si, resp.) and its corresponding p-values 
{pf , resp.), then estimate the corresponding false discovery rates (FDR) by 
computing q-values [Storey (2003)] based on pf {pf , resp.), using R package 
"qvalue." FDR is estimated similarly for p-values on the individual repli- 
cates. 

For our method, we classify a call as correct (or incorrect), when a genuine 
(or spurious) signal is assigned an idr value smaller than an idr threshold. 
Correspondingly, for a call from individual replicates. Fisher's method or 
Stouffer's method, the same classification applies, when its corresponding 
g-value is smaller than the threshold. We compare the discriminative power 
of these methods by assessing the trade-off between the number of correct 
and incorrect calls made at various thresholds. 

In an attempt to generate realistic simulations, we first estimated param- 
eters from a ChlP-seq data set (described in Section 4 using the model in 
Section 2.2), then simulated the signals on a pair of replicates using the 
sampling model (3.1). We performed four simulations, SI, S2, S3 and S4, as 
follows, with simulation parameters in Table 1: 
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Table 1 

Simulation parameters and parameter estimation in the simulation studies of 100 data 
sets. Each data set consists of 10,000 pairs of observations. The simulation parameters 
are estimated from a ChlP-seq data set. In all simulations, fio =0, (Jq ~ 1 and po = 0. In 
S1-S3, TTo = 1 — TTi. S4 has a third component with p2 ~0, cfq — 1, p2 = 0.64, tt2 = 0.07 
and TTo = 1 — TTi — 7r2 . The table shows the mean and the standard deviation of the 
estimated parameters over the 100 data sets using our model 









Pi 


AH 




SI 


True parameter 


0.650 


0.840 


2.500 


1.000 




Estimated values 


0.648 (0.005) 


0.839 (0.005) 


2.524 (0.033) 


1.003 (0.024) 


S2 


True parameter 


0.300 


0.400 


2.500 


1.000 




Estimated values 


0.302 (0.004) 


0.398 (0.024) 


2.549 (0.037) 


1.048 (0.032) 


S3 


True parameter 


0.050 


0.840 


2.500 


1.000 




Estimated values 


0.047 (0.004) 


0.824 (0.026) 


2.536 (0.110) 


0.876 (0.087) 


S4 


True parameter 


0.650 


0.840 


3.000 


1.000 




Estimated values 


0.669 (0.005) 


0.850 (0.005) 


3.021 (0.031) 


1.058 (0.029) 



51 This simulation was designed to demonstrate performance when the data 
are generated from the same copula mixture model we use for estima- 
tion. Data were simulated from the model (3.1) with K = 2, using the 
parameters estimated from the ChlP-seq data set considered below. The 
resulting data contained vri = 65% signals and ttq = 35% noise. 

52 A simulation to assess performance of our method when the correlation 
between genuine signals is low. Data were simulated as in SI {K = 2), 
except that pi = 0.4 and vri = 0.3. 

53 A simulation to assess performance of our method when only a small 
proportion of real but highly correlated signals are present. Data were 
simulated as in SI {K = 2), except that vri = 0.05. 

54 Here simulation parameters were chosen to illustrate a scenario when re- 
producible noise is present in addition to random noise and real signals. 
The goal is to assess the sensitivity of our method to deviations from 
the assumption that genuine signals are reproducible and noise is irre- 
producible. Data were simulated from a three-component model (i.e., 
K = 3) using (3.1), where 7r2 = 7% reproducible noise is added as the 
third component with p2 = 0.64, /U2 = and o"! = 1, and the parameters 
for signals and random noise are as in SI, except ttq = 28% and pi = 3. 

For each parameter set, we simulated 100 data sets, each of which consists 
of two replicates with 10,000 signals on each replicate. In each simulation, we 
ran the estimation procedure from 10 random initializations, and stopped 
the procedure when the increment of log-likelihood is <0.01 in an iteration 
or the number of iterations exceeds 100. All the simulations converge, when 
starting points are close to the true parameters. The results that converge 
to the highest likelihood are reported. 
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3.2.1. Parameter estimation and calibration of IDR. In S1~S3, the pa- 
rameters estimated from our models are close to the true parameters (Ta- 
ble 1). The only exception is that cJi was underestimated when the propor- 
tion of true signals is small, tti = 0.05, a case hard to distinguish from that 
of a single component. 

The irreproducible discovery rate as a guide for the selection of the sig- 
nals needs to be well calibrated. To check the calibration of thresholds, we 
compare the actual frequency of false calls, that is, empirical FDR, with 
the estimated IDR for our method and with the (/-values for other methods 
(Figure 4, left column). 

As shown in Figure 4, the original significance scores and other combina- 
tion methods are overly conservative in their estimated FDR in all simula- 
tions, whereas our method is reasonably well calibrated in SI, S2 and S3. 
When reproducible noise is present (S4), our method slightly overestimates 
the proportion and the correlation of the real signals, and underestimates 
the empirical FDR (Figure 4-S4). This reflects that the data contains some 
artifacts that receive reproducible high scores on their original measures and 
consequently receive relatively low idr values. These artifacts are difficult to 
distinguish from genuine signals. We will compare the discriminative power 
of all four methods in the next section. 

3.2.2. Comparison of discriminative power. To assess the benefit of com- 
bining information on replicates and compare with existing methods of com- 
bining values, we compared our method with the p- values on individual 
replicates, Fisher's method and Stouffer's method, by assessing the trade-off 
between the numbers of correct and incorrect calls made at various thresh- 
olds. As a small number of false calls is desired in practice, the comparison 
focuses on the performance in this region. 

In all simulations, our method consistently identifies more true signals 
than the original significance score and the two p- value combination meth- 
ods, at a given number of false calls in the studied region. Even when re- 
producible artifacts are present (S4) or only a small proportion of genuine 
signal is present (S3), our method still outperforms all methods compared 
here. 

4. Applications on real data. 

4.1. Comparing the reproducibility of multiple peak callers for ChlP-seq 
experiments. We now consider an application arising from a collabora- 
tive project with the ENCODE consortium [ENCODE Project Consortium 
(2004)]. This project has three primary goals: comparing the reproducibility 
of multiple algorithms for identifying protein-binding regions in ChlP-seq 
data (described below), selecting binding regions using a uniform criterion 
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Fig. 4. Calibration of IDR (Left) and comparison of discriminative power (Right) in 
simulation studies. Left: Estimated error rate (x-axis: IDR for our method and FDR for 
other methods) is compared with the actual frequency of false identifications (y-axis). Right: 
The number of correct and incorrect calls made at various thresholds in simulation stud- 
ies. Incorrect calls: The number of spurious signals assigned idr values smaller than the 
thresholds (our method) or with q-values smaller than the cutoffs (other methods). Correct 
calls: The number of genuine signals assigned idr values smaller than the thresholds (our 
method) or with q-values smaller than the cutoffs (other methods). 
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for data from different sources (e.g., labs), and identifying experiment results 
in poor quality. 

We now state the background of ChlP-Seq data in more detail and refer to 
Park (2009) for a recent review. A ClilP-seq experiment is a high-throughput 
assay to study protein binding sites on DNA sequences in a genome. In 
a typical ChlP-seq experiment, DNA regions that are specifically bound 
by the protein of interest are first enriched by immunoprecipitation, then 
the enriched DNA regions are sequenced by high-throughput sequencing, 
which generates a genome-wide scan of tag counts that correspond to the 
level of enrichment at each region. The relative significance of the regions 
are determined by a computational algorithm (usually referred to as a peak 
caller), largely according to local tag counts, based on either heuristics or 
some probabilistic models. The regions whose significance are above some 
prespecified threshold then are identified. To date, more than a dozen of the 
peak callers have been published. Some common measures of significance 
are fold enrichment, p- value or g- value [Storey (2003)]. 

Though these scores may reflect the relative strength of evidence for pu- 
tative binding regions to be real, determination of a proper threshold is not 
straightforward, especially for heuristic-based scores, where arbitrary judg- 
ment often has to be involved. In fact, this difficulty could also exist for 
probabilistic-based scores if the underlying probabilistic models are inad- 
equate to capture the complexity of the data. Because tuning parameters 
for each data set are usually infeasible due to lack of ground truth, default 
thresholds are often used in practice, though they may not be the optimal 
choices for the data to be analyzed. Ideally, an objective performance as- 
sessment should reflect the behavior of peak callers instead of the effect of 
thresholds. 

Here we use the binding regions identified at untuned thresholds in a CTCF 
ChlP-seq experiment (described below) to illustrate how our method is used 
for assessing and comparing the reproducibility of peak callers when tun- 
ing thresholds are unavailable, for setting a reproducibility-based threshold 
that is appliable to both heuristic and probabilistic-based significance scores, 
and for identifying results with low reproducibility. A detailed analysis on 
a comprehensive set of ENCODE data will appear elsewhere. 

4.1.1. Description of the data. In this comparison the ChlP-seq experi- 
ments of a transcription factor CTCF from two biological replicates were 
generated from the Bernstein Laboratory at the Broad Institute on hu- 
man K526 cells. Peaks were identified in biology labs, using nine commonly 
used and publicly available peak callers, namely, Peakseq [Rozowsky et al. 
(2009)], MACS [Zhang et al. (2008)], SPP [Kharchenko, Tolstorukov and 
Park (2008)], Fseq [Boyle et al. (2008)], Hotspot [Thurman et al. (2011)], 
Erange [Mortazavi et al. (2008)], Cisgenome [Ji et al. (2008)], Quest [Valouev 
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et al. (2008)] and SISSRS [Jothi et al. (2008)], using their default significance 
measures and default parameter settings with either default thresholds (all 
peak callers except Hotspot) or more relaxed thresholds (Hotspot). Among 
them, Peakseq and SPP use g-value, MACS, Hotspot and SISSRS use p- 
value, and the rest use fold enrichment, as their significance measures. Only 
the outputs from peak callers were available for our analysis. 

The peaks generated from different algorithms have substantially different 
peak widths. SPP and SISSRS generate peaks with fixed width of 100 bp and 
40 bp, respectively; all other algorithms generate peaks with varying peak 
width (median = 130-760 bp). Because wider peaks are more likely to hit 
true binding sites by chance than shorter peaks, we normalized peak width 
by truncating the peaks wider than 40 bp down to intervals of 40 bp centered 
at the reported summits of peaks, so that reproducibility is compared on the 
same basis. The choice of 40 bp was made because the peak caller with the 
narrowest average peak width in our comparison reports peaks with a fixed 
width of 40 bp. Prior to applying our method, peaks on different replicates 
are paired as identifying the same binding region, if their coverage regions 
overlap (i.e., overlap > 1 bp). Because peaks without matches do not have 
replicate measurements and are apparently irreproducible, here we elected 
to assess reproducibility of paired peaks in our analysis. Around 23-78% of 
peaks are retained for this analysis. 

4.2. Results. 

4.2.1. Correspondence profiles. Figure 5 shows the correspondence pro- 
files for the nine peak callers. By referring to the prototypical plots in Fig- 
ure 3, five peak callers (Peakseq, MACS, SPP, Fseq and Hotspot) show the 
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Fig. 5. The change of correspondence ('^'n) along the decreasing order of significance, 
plotted for 9 peak callers on a CTCF ChlP-seq experiment from ENCODE. X-axis: The 
rank list of peaks identified on a replicate. Y-axis: 
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characteristic transition from strong association to near independence [Fig- 
ure 5(b)]. As described in Section 2.1, when heterogeneity of association is 
present, a high reproducibihty translates to late occurrence of the transition 
to a segment with a positive slope. According to how much down the rank list 
the transition is observed, the three peak callers that show the highest repro- 
ducibility on this data set are Peakseq, MACS and SPP (Figure 5). For the 
other four peak callers (Erange, Cisgenome, Quest and SISSRS), the curves 
display a less clear transition and report substantially fewer (reproducible) 
peaks. This indicates that the default thresholds for these peak callers are 
likely to be too stringent to reach the breakdown of consistency, and that 
the reported peaks have relatively low reproducibility across replicates. This 
conclusion was confirmed later by biological verification (see Section 4.2.3). 

4.2.2. Inference from the copula mixture model. We applied the copula 
mixture model to the peaks identified on the replicates for each peak caller. 
As data may consist of only one group with homogeneous association, we also 
estimated the fit using a one-component model that corresponds to setting 
TTi = 1, /xi = and o"^ = 1 in (2.5). We then tested for the smallest number of 
components compatible with the data, using a likelihood ratio test statistic 

(A = j^), where L2 and Li are the likelihood of two-component and one- 
component models, respectively. With mixture models, it is well known that 
the regularity conditions do not hold for 21og(A) to have its usual asymp- 
totic Chi-square null distribution. We therefore used a parametric bootstrap 
procedure to obtain appropriate p- values [McLachlan (1987)]. In our proce- 
dure, 100 bootstrap samples were sampled from the null distribution under 
the one component hypothesis using the parametric bootstrap, where the 
parameter estimate was obtained by maximizing the pseudo-likelihood of 
the data under the null hypothesis of the one-component model. Then p- 
values were obtained by referring to the distribution of the likelihood ratio 
computed from the bootstrap samples. Table 2 summarizes the parameter 
estimation from both models and the bootstrap results. 

Based on the likelihood ratio test, it seems that the one-component model 
fits the results from SISSRS, Quest and Cisgenome better, and the two- 
component model fits the results from other peak callers. This is consistent 
with the pattern of transition in the correspondence profiles (Figure 5). 

To select binding sites, we rank putative peaks by the values of local idr 
and compute the irreproducible discovery rate (IDR) for peaks selected at 
various local idr cutoffs using (2.10), as described in Section 2.3. We illustrate 
the IDR as a function of the numbers of top peaks (ranked by local idr) for 
all peak callers in Figure 6. For a given IDR level, one can determine the 
number of peaks to be called from this plot, regardless of what type of 
scores are used to measure the significance of peaks. For example, at 5% 
IDR, the top 27,500 peaks with the smallest local idr can be called using 
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Table 2 

Parameters estimated from the copula mixture model and the one-component model, and 
model selection for determining the number of components. (7ri,pi,/ii,(Ti) are parameters 

estimated from the copula mixture model; p is estimated from the single- component 
model. The number of components is selected using a likelihood ratio test and the p-value 
of the test statistics is determined using a parametric bootstrap approach based on 100 

bootstrap samples 





Peakseq 


MACS 


SPP 


Fseq 


Hotspot 


Cisgenome 


Erange 


Quest 


Sissrs 


TTl 


0.69 


0.84 


0.77 


0.74 


0.69 


0.85 


0.72 


0.72 


1 


Pi 


0.89 


0.89 


0.88 


0.82 


0.88 


0.65 


0.81 


0.67 


0.24 


Ml 


2.27 


2.07 


2.28 


2.12 


1.62 


2.05 


2.04 


2.01 


7.27 


cri 


0.87 


1.34 


1.05 


0.86 


0.64 


1.35 


0.90 


1.39 


0.03 


P 


0.87 


0.87 


0.86 


0.83 


0.78 


0.66 


0.80 


0.66 


0.23 


p-value 

















1 





1 


1 



MACS. Using the same reproducibility criterion, peaks can be selected for 
other peak callers similarly. 

We also compare the overall reproducibility of different peak callers us- 
ing Figure 6. For example, while Peakseq, MACS and SPP on average have 
about 3% irreproducible peaks when selecting the top 25,000 peaks, most 
of the other peak callers have already reached a much higher IDR before 
identifying the top 10,000 peaks. According to the number of peaks iden- 
tified before reaching 5% IDR, the three most reproducible peak callers on 
this data set are Peakseq, MACS and SPP, then followed by Fseq, then 
others. This result is consistent with the graphical comparison based on the 
correspondence profile (Figure 5). 
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Fig. 6. Irreproducible discovery rate (IDR) at different numbers of selected peaks, plot- 
ted at various idr cutoffs for eight peak callers on a CTCF Chip-seq experiment from 
ENCODE. Peaks are selected using local idr. X-axis: The rank list of peaks, ranked by 
local idr, Y-axis: Irreproducible discovery rate (IDR). SISSRS is not shown because its 
results are highly inconsistent and all peaks are grouped into a low correlation group. 
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4.2.3. Evaluating the biological relevance of the reproducibility assessment. 
To evaluate the biological relevance of our reproducibility assessment, we 
check the accuracy of peak identifications using external biological infor- 
mation. Because a complete list of true binding regions is not known for 
the examined data set, the accuracy of peak identifications is assessed using 
high-confidence binding motifs computationally predicted using sequence 
information [Kheradpour et al. (2007)], which is a commonly used device 
in this setting [e.g., Zhang et al. (2008); Kharchenko, Tolstorukov and Park 
(2008), among many others]. Though high-confidence motifs are not required 
to be bound and true binding sites are not required to exhibit a motif sig- 
nature, the high-confidence motif instances are assumed, standard in this 
context, to contain a representative subset of true binding regions and are 
expected to have a relatively high occurrence in high-scored ChlP-seq peaks 
[Kharchenko, Tolstorukov and Park (2008)]. We selected ChlP-seq peaks 
reported by each peak caller at various IDR thresholds, and examined the 
number of high-confidence motifs (FDR < 0.1 at the PWM threshold of 
p- value = pTi) that coincide with the reported ChlP-seq peaks (defined as 
overlap > 1 bp) (Figure 7). 

For the peak callers whose reported peaks fit the two-component model 
(i.e., Peakseq, MACS, SPP, Fseq and Hotspot), we marked the number 
of ChlP-seq peaks selected at IDR = 5%. For these algorithms, the motif 
occurrences first increase with the increase of reported ChlP-seq peaks, then 
plateau before reaching the default thresholds (Figure 7). The mark of 5% 
IDR approximately corresponds to the occurrences of the plateau, with few 
additional motif occurrences if more ChlP-seq peaks are called. 




5000 10000 15000 20000 25000 30000 35000 
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Fig. 7. The coverage of high- confidence CTCF motif at different numbers of selected 
ChlP-seq peaks, plotted at various idr cutoffs for nine peak callers on a CTCF Chip-seq 
experiment from ENCODE. The bars on the curves of Peakseq, MAC'S, SPP, Fseq and 
Hotspot show the number of peaks selected at IDR — 0. 05. No selection is made for the 
rest of the peak callers because model selection favors the one-component model for peaks 
identified by these callers. 
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On the other hand, for the peak cahers (Erange, Cisgenome, Quest and 
SISSRS) whose peaks fit the one-component model, the motif occurrence 
still shows an increasing trend at default thresholds. This confirms the ob- 
servation from correspondence curves (Figure 5) that the default thresholds 
for these peak callers are likely to be overly stringent for this data set. 

Overall, the results of this analysis agree with the assessment from our 
reproducibility comparison: the three peak calling results with the highest 
reproducibility (SPP, Peakseq and MACS) in Figure 6 show the highest 
rates of motif occurrence among all algorithms of comparison; the ones that 
are reported to be less reproducible do show lower rates of motif occurrence. 
This illustrates the potential of our method as a quality measure. 

5. Discussion. We have presented a new statistical method for measur- 
ing the reproducibility of results in high-throughput experiments and setting 
selection thresholds using a reproducibility criterion. Using simulated and 
real data, we have illustrated the potential of our method for providing re- 
producibility assessment that is not confounded with prespecified threshold 
choices, determining biologically relevant thresholds, improving the accuracy 
of signal identification, and identifying suboptimal results. 

As no assumption is made on the scale of the scores, the proposed method 
is applicable for any scoring system that produces continuous ranking to re- 
flect the relative ordering of the signals. It provides a principled way to 
select signals that are scored on heuristic measures, and complements the 
thresholds determined on individual replicates. Moreover, because consis- 
tency between replicates is an internal standard that is independent of the 
scoring schemes and comparable across data sets, the proposed reproducibil- 
ity criterion is suited for setting uniform standards for selecting signals for 
data from multiple sources, such as consortium studies. Because our mea- 
sure of consistency is not confounded by platform-dependent thresholds, 
inter-platform consistency can be assessed easily. 

Of course, reproducibility is only a necessary but not a sufficient condition 
to accuracy. If replicates are generated in the presence of a systematic bias 
that introduces false association, the threshold derived from this procedure 
may underestimate the empirical false discovery rate. Though the thresh- 
olds determined by our method show reasonable biological relevance in the 
data examined here and many other ENCODE ChlP-seq experiments (to 
appear in another manuscript), we emphasize that some cares are necessary 
to ensure that the replicates maintain the level of independence that they 
should. 

We also note that the reproducibility of outputs from a data-analytical 
method (e.g., a peak caller) on replicate samples reflects the combined prop- 
erties of the method and the samples. As the behavior of the data-analytical 
method may vary across different samples, the reproducibility assessment in 
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our example should be interpreted as being specific to the studied data set, 
instead of a general conclusion. A detailed comparison of the performance 
of peak callers has been evaluated on a comprehensive set of data and will 
appear elsewhere. 

The algorithm to implement the estimation strategy outlined in Sec- 
tion 2.2.3 is provided as supplemental material to this article. An R pack- 
age is downloadable at the following website: http://cran.r-project.org/web/ 
packages / idr / index.html . 

SUPPLEMENTARY MATERIAL 

Supplementary materials for Measuring reproducibility of high-through- 
put experiments (DOI: 10.1214/11-AOAS466SUPP; .pdf). This supplement 
consists of four parts. Part 1 describes the algorithm for estimating parame- 
ters in our copula mixture model. Part 2 provides a theoretical justification 
for the efficiency of our estimator for the proposed copula mixture model 
when n is large. Part 3 derives the properties of the correspondence curves 
in Section 2.1.1. Part 4 provides an extension of our model to the case with 
multiple replicates. 
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